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[57] ABSTRACT 

An apparatus and method for identifying one of a plurality 
of documents stored in a computer-readable medium are 
disclosed. The method includes the steps of prompting a 
computer-user to construct a search expression, then com- 
municating the search expression to each of a plurality of 
search engines located at respective World Wide Web sites. 
Each of the plurality of search engines is prompted to 
concurrently identify a respective plurality of web pages 
containing text consistent with the search expression and to 
return a respective URL for each such web page identified. 
Redundant URLs returned by the search engines are filtered 
to obtain an initial set of web pages. Each of the initial set 
of web pages is downloaded and linguistically analyzed to 
automatically identify for the computer-user keyword 
phrases therein. The computer-user is prompted to construct 
a query expression in which one or more keyword phrases 
from the initial set of web pages is an operand. The query 
expression is then used to identify at least one web page of 
the initial set of web pages and the identified web page is 
presented to the user in the form of an abstract. 

16 Claims, 22 Drawing Sheets 
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BROWSE BY PROMPTED KEYWORD initial set of web pages has been identified by the search 

PHRASES WITH AN IMPROVED METHOD engine, the user is still faced with the content discovery 

FOR OBTAINING AN INITIAL DOCUMENT problem described above. Namely, unless the user already 

SET knows the exact web page sought, the user may have to 

5 supply additional search terms to reduce the number of web 

CROSS-REFERENCE TO RELATED p ages i D the initial set or, in the worst case, browse the initial 

APPLICATIONS set of web pages one after the other until something of 

This is a continuation-in-part of application Sen No. interest appears. 

08/687,656, now U.S. Pat. No. 5,721,897, filed Jul. 26, li would be desirable to allow the user to browse local 

1996, which is a continuation-in-part of application Ser. No. 10 flles or web P a S es b y extracting the essential concepts of the 

08/628,098, now U.S. Pat. No. 5,794,233, filed Apr. 9, 1996. local fil es 01 web P a S es and presenting them to the user in 

the form of an abstract. Furthermore, it would be desirable 

BACKGROUND OF THE INVENTION to relieve the user of the burden of conceiving search terms 

1 Field of the Invention by automatic ally identifying keyword phrases in the initial 

' . , . , 15 set of local files or web pages and presenting them to the user 

The present invention re ates to the field of computerized a , ^ ^ ^ ^ ^ to i(Jenlify a doclunent ^ llser 

document management. More specifically, the present cou , d ^ ^ one Qr more of ^ k ^ hras6Sj join 

invention relates to a method and apparatus for obtaining an tnem ^ g ession md allow the to 

initial set of documents and then identifying one of the initial iden , if Qn6 of more locd fiks of web most nearly 

set of documents by permitting a computer user to browse M ^ . ^ lo ^ cal ^ of keyword phrases . A]so> 

the documents by prompted keyword phrases using an ft wou]d be desirab]e to mor6 rapidly and comprehensively 

improved user interface. search ^ Wofld wide Web tQ locate m initial ^ of web 

2. Art Background pages containing a user-specified search expression. These 

In modern computer application programs, such as com- and other benefits are achieved by the method and apparatus 

mercially available word processor programs, a user choos- 25 of the present invention. 

ing to open a data file is typically provided with a list of data SUMMARY OF THE INVENTION 

files contained in the active directory or folder and prompted 

to select one. The process of selecting a data file varies based A method and apparatus for identifying one of a plurality 
on the user's foreknowledge of the data file sought, and of documents stored in a computer-readable medium are 
generally falls into one of four cases. First, if the user knows 30 disclosed. The method allows a computer user to browse the 
the name of the file sought and the filename is listed, the user plurality of documents by prompting the user to construct a 
simply selects that file. Second, if the user does not know the query expression from an automatically generated list of 
filename but knows the general nature of the subject matter keyword phrases. Once selected by the user, the query 
sought, the user may still be able to select the file of interest expression is used to identify one of the plurality of docu- 
on the basis of its filename. In this case, the user may have 35 ments and an abstract of the identified document is presented 
to open and examine the content of several files having to the user. Identification of the keyword phrases and gen- 
filenames related to the subject of interest before opening a eration of the abstract is performed by linguistically analyz- 
satisfactory file. If, in a third case, the user doesn't know the ing the documents. The method of the present invention 
name of the file sought or even the general nature of the includes the steps of automatically identifying for a user 
subject matter sought, but seeks a file referencing or dis- 40 keyword phrases in the plurality of documents, prompting 
cussing a specific word or phrase, the user may need to open the user to construct a query expression in which at least one 
each of the files in turn and perform either a manual or of the keyword phrases is an operand, and identifying one of 
automated search for the "keyword phrase" of interest. File the plurality of documents based on the query expression, 
by file search for keyword phrases can be time consuming In addition, an improved user interface provides the 
and tedious, particularly if there are a large number of files. 45 capability to display either or both key words and key 
In most instances, consequently, the search for keyword phrases on the display screen in separately scrollable display 
phrases within files can be automated either by application areas. These separately scrollable display areas are dynami- 
program or by operating system utility (the former being cally sized to render visible the selected text. A set of 
exemplified by search features commonly provided by word dynamically created tabs in a tabbed index provide a means 
processors, the latter by the UNIX grep utility). In the fourth 50 to index into the content of each display area. The font of the 
and final case, if the user doesn't know the filename, subject selected and displayed text is dynamically set to maximize 
matter or even keyword phrases sought, but simply wishes the display area. The plurality of documents from which key 
to browse the documents until something of interest appears, words or key phrases are taken may be documents from a 
the user must do this on a file by file basis. computer network, including web pages from the World 
The Internet presents a similar content discovery problem, 55 Wide Web, or documents from a local hard-drive. A concept 
but on a much larger scale. On the World Wide Web (the editor allows key words or key phrases to be grouped under 
"web"), the graphical portion of the Internet, an enormous a concept identifier and used in document search queries, 
number of documents referred to as "web pages" are linked A method and apparatus is disclosed for identifying one of 
together through Hypertext Markup Language (HTML) con- a plurality of documents stored in a computer-readable 
structs to form a single searchable data object. A search 60 medium, the method comprising the computer- implemented 
engine, itself located at an Internet site, can be used to steps of: 1) automatically identifying for a user keyword 
identify web pages containing a user-specified expression in phrases in the plurality of documents; 2) displaying a tabbed 
a manner analogous to the way a UNIX grep utility can be index indicative of content of the keyword phrases; 3) 
used to locate search expressions within local files. Search- prompting the user to construct a query expression in which 
ing for data on the web using a search engine presents at 65 at least one of the keyword phrases is an operand; and 4) 
least two problems, however. First, due to the volume of identifying one of the plurality of documents based on the 
traffic on the web, searching can be slow. Second, once an query expression. 
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An improved method for searching the World Wide Web 
to identify an initial set of documents is also disclosed. The 
computer-user is prompted to enter a search expression that 
can be used to identify the initial set of documents and the 
search expression is communicated to a plurality of Internet 
search engines. The search engines are prompted to concur- 
rently inspect a respective plurality of web pages and return 
the URLs of web pages containing text consistent with the 
search expression. Redundantly returned URLs are filtered 
so that a non- redundant initial set of web pages is identified 
from which an automatically generated list of keyword 
phrases can be extracted. The list of keyword phrases can 
then be used to prompt the user to construct a query 
expression as described above. 

BRIEF DESCRIPTION OF THE DRAWING 

The features and advantages of the present invention will 
be more fully understood by reference to the accompanying 
drawing, in which: 

FIG. 1 illustrates a method according to the present 
invention. 

FIG. 2 depicts one embodiment of a user-interface accord- 
ing to the present invention. 

FIG. 3 depicts a search pane used to construct a query 
expression. 

FIG. 4 illustrates a general purpose computer utilized to 
perform the method steps of the present invention. 

FIG. 5 depicts one embodiment of an improved user- 
interface showing a keyword and a key phrase window pane 
with dynamic index tabs. 

FIG. 6 depicts one embodiment of the improved user- 
interface showing WWW web pages. 

FIG. 7 depicts one embodiment of the improved user- 
interface showing the concept editor of the present inven- 
tion. 

FIG. 8 illustrates a method for identifying one of a 
plurality of web pages on the World Wide Web. 

FIG. 9 depicts a Control window used to display a search 
expression constructed by a computer-user. 

FIG. 10 depicts a Contents View window used to display 
URLs returned by web searching engines. 

FIG. 11 depicts a Phrases View window used to display 
keyword phrases obtained by linguistically analyzing each 
of an initial set of web pages. 

FIG. 12 depicts a Words View window used to display 
keywords obtained by linguistically analyzing each of an 
initial set of web pages. 

FIG. 13 depicts a Links View window used to display 
search expressions, search engine expressions and web page 
URLs. 

FIG. 14 depicts a Discards View window used to display 
the URLs of web pages in the initial set of web pages that 
were not available for download. 

FIG. 15 depicts an Abstract window used to display an 
abstract of a web page. 

FIG. 16 depicts a Quick Setup options window used to 
allow a computer-user to specify characteristics of a host 
computer. 

FIG. 17 depicts a Search options window used to allow a 
computer-user to specify the web searching engines to be 
used to identify an initial set of web pages. 

FIG. 18 is a block diagram of an application program 
according to one embodiment of the present invention. 
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FIG. 19 is an execution diagram for a user-interface. 
FIG. 20 is an execution diagram for procedure Generate- 
WorkList. 

FIG. 21 is an execution diagram for procedure StartWork. 
FIG. 22 is an execution diagram for a web agent. 

DETAILED DESCRIPTION OF THE 
INVENTION 

10 In the following detailed description of the present inven- 
tion numerous specific details are set forth in order to 
provide a thorough understanding of the present invention. 
However, it will be obvious to one skilled in the art that the 
present invention may be practiced without these specific 

15 details. 

Overview of a Method For Identifying One of a 
Plurality of Documents 

FIG. 1 illustrates a method for identifying one of a 
20 plurality of documents stored in a computer-readable 
medium by prompting a computer user (typically a human 
operator) to construct a query expression from an automati- 
cally generated list of keyword phrases. Herein the term 
document refers to a computer-readable arrangement of data 
25 and includes ASCII and other character based files as well as 
binary files having a format interpretable by an application 
program. In the present invention, these documents may be 
locally resident files or pages on the World Wide Web 
(WWW). The web pages are stored at web sites on the 
30 WWW and accessible using a Universal Resource Locator 
(URL). 

At step 110, each of the plurality of documents is linguis- 
tically analyzed to identify keyword phrases therein, and the 
identified keyword phrases are presented to the user. A 

35 keyword phrase is a combination of two or more words 
expressing a significant concept, and a document is said to 
contain a keyword phrase if the keyword phrase literally 
appears in the document or its basis for derivation appears 
in the document. For example, a document containing the 

40 phrase "clothing that is machine washable" contains the 
keyword phrase "machine washable clothing" because, even 
though "machine washable clothing" does not appear liter- 
ally in the document, the basis for deriving the keyword 
phrase does. Linguistic analysis and identification of key- 

45 word phrases is discussed further below. At step 120, the 
user is prompted to construct a query expression by selecting 
keyword phrases from the presented set of keyword phrases. 
A query expression is a logical expression in which one or 
more keyword phrases appear as operands. At step 130, one 

50 or more of the plurality of documents is identified based on 
the constructed query expression. For example, if the plu- 
rality of documents consists of the set (A, B, C, D, E) and 
the query expression is: keyword phrase 1 AND keyword 
phrase 2 AND NOT keyword phrase 3, then the document 

55 from the set (A, B, C, D, E) satisfying or most nearly 
satisfying the query expression (i.e., containing keyword 
phrase 1 and keyword phrase 2, but not containing keyword 
phrase 3), will be identified by step 130. At step 140, an 
abstract of the document is generated, and at step 150 the 

60 document abstract is presented to the user. As will be 
discussed further below, the document abstract is obtained 
by linguistic analysis of the identified document to identify 
key concepts therein. 

65 First Embodiment of the User-interface 

FIG. 2 depicts a user-interface 200 allowing a computer 
user to identify one of a plurality of documents in accor- 
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dance with the method described above. The user-interface to as a "drag and drop" operation). As stated above, a query 

200 is presented to the user in response to a document select expression is a logical expression in which one or more 

request such as a request to open a data file in a word keyword phrases appear as operands. The act of dropping a 

processing or other text-intensive application. The user may selected keyword phrase into one of the two logic panes (242 

not know a priori the specific document or even the subject 5 or 246) within search pane 240 causes the keyword phrase 

matter he or she seeks. In the present invention, therefore, to be logically joined to the query expression. The nature of 

the user-interface 200 includes a dialog box 201 that pre- the logical join is determined by the logic pane (242 or 246) 

sents an automatically generated list of search terms, into which the keyword phrase is dropped. Logic pane 242 

referred to as keyword phrases, in a keyword pane 205. The is referred to as the "INCLUDE" logic pane and keyword 

listed keyword phrases 206 act to "prompt" the user to 10 phrases dropped therein are initially joined to the query 

search for information of interest without requiring the user expression by a logical AND operator. When joined to the 

to conceive search terms. Thus, the present invention query expression by a logical AND operator, a keyword 

relieves the user from the burden of creating a document phrase must be contained by the document sought in addi- 

search expression, and instead permits the user to browse the tion to the previously formulated query expression. The 

relevant documents on the basis of the automatically gen- ^ phrase "previously formulated query expression" is used 

erated keyword phrases 206. Keyword entry pane 215 is herein to refer to the query expression as it exists prior to a 

provided to allow the user to enter keyword phrases that do drag and drop event and, if no prior drag and drop events 

not appear in the keyword pane 205. The keyword phrases have occurred, the previously formulated query expression 

206 listed in the keyword pane 205 prompt the user to may consist of an empty set of keyword phrases. In one 

construct a query expression which will be used to identify 2 o embodiment, the logical AND operator joining a keyword 

one of the plurality of documents. Beside each keyword phrase dropped in the INCLUDE logic pane 242 to the query 

phrase presented in keyword pane 205 is the relevance code expression may be converted to a logical OR operator by 

208 of the keyword phrase. Relevance codes 208 are values placing the mouse cursor over the keyword phrase and 

indicating the importance of the keyword phrase relative to depressing the right mouse button. A menu will be presented 

other keyword phrases in the document. As stated above, the 25 with a selection allowing the logical operator to be toggled 

keyword phrases are obtained by linguistically analyzing between AND and OR. As will be discussed further below 

each of a plurality of documents, and, in the preferred in reference to FIG. 3, each keyword phrase joined to the 

embodiment, relevance codes are generated by the linguistic query expression by a logical OR operator is associated with 

analysis. Linguistic analysis and the relevance codes result- the nearest preceding keyword expression joined to the 

ing therefrom are discussed in greater detail below. 30 query expression by a logical AND operator. Search pane 

Dialog box 201 includes a file list pane 220 listing the 240 also includes "NOT" logic pane 246 for specifying 

documents 221 to be searched. The documents to be query expressions that are not to appear in the document 

searched are drawn from an archive catalog; an arbitrary sought. 

collection of documents that constitute a single searchable In one embodiment of the present invention, the query 
entity. The archive catalog open at any given time is the 35 expression is displayed in query pane 250 as each of its 
archive catalog from which the keyword phrases 206 in constituent keyword phrases is selected. Query pane 250 
keyword pane 205 are drawn and the name of the open enables the user to type a query expression or to edit a query 
archive catalog appears in the title bar 202 of dialog box 201. expression previously constructed via the drag and drop 
In one embodiment, the computer user may construct and technique described above. In this way, complex query 
save archive catalogs by selecting documents from a list of 40 expressions may be specified which might be difScult or 
documents presented by the computer operating system or awkward to construct using the drag and drop technique 
its extensions (e.g., the Apple Macintosh Finder or the alone. Further, query pane 250 includes a down arrow 252, 
Microsoft Windows '95 Explorer). Alternatively, archive which, when selected by the user presents a history of prior 
catalogs can be created automatically from the group of query expressions that may be recalled, 
documents residing in an identified area of a computer 45 FIG. 3 depicts a search pane containing an exemplary 
system's file storage such as a folder or directory. When query expression constructed using the interface of one 
constructed, an archive catalog becomes the open archive embodiment of the present invention. The query expression 
catalog and each of the documents therein appear in file list "(dog: security OR watchdog OR guard dog OR police dog) 
pane 220. The user may also recall previously constructed and (doberman or german shepherd) AND NOT (pit bull)" 
archive catalogs. For archive catalogs containing more 50 may be constructed from a keyword phrase list containing 
documents than can be presented in the file list pane 220 at the query expression's constituent keyword phrases as fol- 
once, the file list pane 220 operates as a virtual window to lows: First, the constituent keyword phrases are selected 
the complete list of documents and scrollbars allow the user from the list of keyword phrases (not shown) and dropped 
to select the viewpoint of the virtual window at points of into INCLUDE logic pane 342 beginning with keyword 
interest along the complete list of documents. 55 phrase "dog:security" and ending with keyword phrase 
Dialog box 201 also includes a search pane 240 which "german shepherd". A this point the query pane (not shown) 
itself contains constituent logic panes 242 and 246. Logic will contain the query expression "dog:security AND watch- 
panes 242 and 246 are logical operation elements; graphic dog AND guard dog AND police dog AND doberman AND 
constructs that represent logical operators. Search pane 240 german shepherd". By converting the logical AND operators 
prompts the user to construct a query expression by asso- 60 corresponding to the keyword phrases "watchdog", "guard 
ciating keyword phrases 206 with logical operation ele- dog", "police dog" and "german shepherd" to logical OR 
ments. In one embodiment, this is accomplished by dragging operators (using the technique described above in reference 
one or more keyword phrases 206 from keyword pane 205 to FIG. 2), the query expression "(dog: security OR watch- 
and dropping each into one of the logic panes 242 or 246 (the dog OR guard dog OR police dog) AND (doberman OR 
physical act of moving a displayed object from one location 65 german shepherd)" is obtained. Since a logical OR operator 
to another is a well known operation performed with a cursor associates a keyword phrase to the nearest preceding key- 
control device such as a mouse or trackball and is referred word phrase joined to the query expression by a logical AND 
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operator, keyword phrases *' watchdog", "guard dog" and by linguistic analysis of the document separate from that 

"police dog" are logically OR'd with the keyword phrase used to identify keyword phrases. Keyword phrases, concept 

"dog;security" and keyword phrase "german shepherd" is sentences, and a document abstract may be generated in a 

logically OR'd with the keyword phrase "doberman". In single linguistic analysis or in separate operations, 

accordance with accepted set-theory notation a single dot s [a Qae embodiment of mc invC ntion, the user may 

adjacent a keyword phrase appearing in INCLUDE logic &e]ect ^ dQcm ffom ^ ^ abstract fe ted fa 

pane 342 mdicates that the keyword phrase is joined to the .... - tL , 4 i* * j • ci r * 

r • u i • i Avm * u-i * a * clicking on any one of the documents 221 listed in nle list 

query expression by a logical AND operator, while two dots , . . . . . . e . 

M ,. J / . j , & . . * . J . , pane 220. In this way, a user can browse the abstract of each 

adjacent a keyword phrase indicate that the keyword phrase *, 4 ,v T u 

. J . , , 4 * r . i • i . document identified by the query expression. In an altema- 

is joined to the query expression by a logical OR operator. 10 ^ embodiment J from one of the identified 

Thus, of the keyword phrases dropped in logic pane 342, - . , , . , , „ , 

i<A . M «j f « i. • i j documents could be presented automatically upon comple- 

doe:secunty and doberman have a single dot adjacent 4 . c u c j * * *u • 

: 6 , J . „ . , . A 6 . A . J A Ca tion of a search for documents meeting the query expression, 

them while the others have two dots adjacent them. Alter T , 4 c . . r\ u \ t 4 t a 

dro in the ke ord hrase " it bull" in the NOT lo c aD ^ CaSe ' document from which the abstract presented 

i .i j • j " . , A j in abstract pane 270 is drawn may be opened by clicking the 

pane 346, the desired query expression is completed. To 15 q ^ ^ uli Q Q 2go 

change the logical relationships between the selected key- p 

word phrases, the keyword phrases may be dragged and In one embodiment of the present invention, the user is 

dropped in different positions within search pane 340. For permitted to create multiple instances of dialog box 201, 

example to logically OR "german shepherd" with "dog:se- each Presenting a list of keyword phrases, a list of docu- 

curity" instead of with "doberman", the keyword phrase 20 m ents and an abstract based Ion the .same -or different archive 

"german shepherd" may be dragged and dropped to a catalo S 35 ^ «° .P res f nt * a °g b ° x j 201 Also, several of 

position preceding (above) "doberman". ^ P^ 5 Wlthin box 201 > 1Dcl ™kng keyword pane 

t ™ ^u n A- m ^i ~f*u~ ™™* • , 7a „.- „ ;t ■ nnc . c : Ma 205, file list pane 220, search pane 240 and abstract pane 

In one embodiment of the present invention, it is possible „ > • .* * i - r 5 L 

, j i j « * u a- » rv, 270, are resizeable to permit more or less information to be 

to group keyword phrases under "concept headings , Con- ' ^ 

cept headings are keyword phrases which serve as a short- 2 5 P resen e erem - 
hand expression for each of the keyword phrases associated Linguistic Analysis 
with them. Thus, when a concept heading 'X y having con- 
stituent keyword phrases 'A, 'B' and 'C is dropped into the In the preferred embodiment of the present invention a 
INCLUDE logic pane, keyword phrases <A, <B' and <C commercially available linguistic analysis tool named Syn- 
become part of the query expression (though, in one 30 tactlca from Iconovex Corporation is used to linguistically 
embodiment, only the concept heading l X> appears in the anal y ze documents. Other linguistic analysis tools, includ- 
query pane). Furthermore, the logical association of key- tools from Inference Corporation and others, may also be 
word phrases that have been grouped under a concept used. Linguistic analysis tools fall generally into one of two 
heading dropped in the INCLUDE logic pane may be categories: referential analyzers and mathematical analyz- 
specified. For example, by repositioning the constituent 35 ers - 

keyword phrases relative to one another and by toggling Referential analyzers, including Syntactica, perform para- 
between logical AND and logical OR operators, keyword graph by paragraph parsing of documents using dictionary 
phrases 'A', 'B' and 'C may be related by: (A OR B) AND definitions of words to identify grammatically and defini- 
C; A AND (B OR C); and so on. Concept headings may be tionally significant phrases (i.e., keyword phrases). Gram- 
entered by the user or selected from the automatically 40 matically significant phrases are identified on the bases of 
generated list of keyword phrases. syntactic analysis, in which syntactically necessary, but 

Returning to FIG. 2, in one embodiment of the present conceptually insignificant terms (such as conjunctions, 

invention, once a query expression is completely articles, etc.) are removed. Identification of definitionally 

constructed, the user initiates a document search by placing significant phrases is termed semantic analysis and involves 

the mouse cursor over Search button 255 and pressing a 45 reference to the dictionary definition of the terms constitut- 

mouse button (i.e., clicking the Search button 255). After the ing ^ e phrase. Based on their grammatical and definitional 

search, the list of documents appearing in file list pane 220 significance relative to one another, the keyword phrases are 

is reduced to the subset of documents meeting the search assigned relevance codes. Syntactica, for example, assigns 

criteria set forth in the query expression. Alternative relevance codes from 1 through 6 to identified keyword 

embodiments, including one in which all of the documents 50 P hrases ^ih 6 indicating highest relevance. In one embodi- 

remained in view, but with the subset of documents meeting ment of the present invention, these relevance codes are 

the query expression indicated in some way (e.g., by high- listcd witil me keyword phrases to which they refer, 

lighting or shading), would be within the spirit and scope of Referring to FIG. 2, the number of keyword phrases 

the present invention. presented in the keyword pane 205 may be controlled by 

The document abstract pane 270 is used to present an 55 filtering the keyword phrases presented based on relevance 

abstract from a document identified based on the query code. Relevance rank selection buttons 217 are provided for 

expression constructed by the user. The identified document this purpose. Clicking on the relevance rank selection button 

is a document meeting the logical criteria set forth in the numbered "6", for example, results in the presentation of a 

query expression. In the example above, for instance, a highly selective and therefore reduced number of keyword 

document having keyword phrases A and B, but not E would 60 phrases, each having a relevance code of 6. Clicking on the 

be identified, as would a document having keyword phrase relevance rank selection button numbered "1", by contrast, 

D, but not E. In one embodiment of the present invention, an results in a less selective, more extensive listing of keyword 

abstract of the identified document is generated by first phrases having relevance codes of 1 or greater, 

performing linguistic analysis on the document to identify Based on the same linguistic analysis described above, 

concept sentences (i.e., sentences containing keyword 65 Syntactica identifies concept sentences. Concept sentences 

phrases) and then combining the concept sentences. In an are sentences containing keyword phrases. As with keyword 

alternative embodiment, the document abstract is generated phrases themselves, the selectivity with which concept sen- 
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tences are defined may be controlled by user selection of a be a modem, a network adapter module or any other device 
relevance filter by clicking a desired one of the abstract for connecting to a computer network, 
relevance rank selection buttons 275. By combining the [ n the preferred embodiment, the individual steps of the 
identified concept sentences, an abstract of the document method of the present invention are performed by the above 
may be obtained. 5 described general purpose computer components pro- 
Mathematical analyzers perform linguistic analysis by grammed with instructions that cause the processor 402 to 
measuring the relative frequency of occurrence of words perform the recited steps. However, the steps of the method 
after they have been converted to stemmed words. A of the present invention may also be performed by specific 
stemmed word is one which has been reduced to its root hardware components that contain hard-wired logic for 
form by removing inflectional elements and otherwise iron- 30 performing the recited steps, or any combination of pro- 
cating declensional and conjugative forms of the words (for grammed general purpose computer components and cus- 
example, reducing "shipped" to "ship", "devices" to torn hardware components. Nothing disclosed herein should 
"device" or "president's" to "president"). Those stemmed be construed as limiting the present invention to a single 
words or groups of stemmed words having a relatively high embodiment wherein the recited steps are performed by a 
frequency of occurrence (i.e., high frequency of occurrence 15 specific combination of hardware components, 
compared to other stemmed words), are considered to be Preferred Embodiment of the Improved User- 
keyword phrases. Relevance codes can be assigned to the interface 
stemmed words based on their relative frequency of occur- 
rence, FIG. 5 depicts an improved user- interface 500 allowing a 

n „ c , tl _ c ... . . . 20 computer user to identify one of a plurality of documents in 

Regardless of whether the referential or mathematical r , iL ™_ 

. ^ , . j , , . j , accordance with the method described above. The user- 

linguistic analyzer is used to parse documents, documents . , c rnn . * j * *i_ 

a* ^ j c ■ i • j c a • * interface 500 is presented to the user in response to a 

may first need to be converted from a specialized format into , * i * » i_ 

c * ■ i_i i_ *i_ v • *• i • ^ i t document select request such as a request to open a data file 

a format recognizable by the linguistic analysis tool. In one , . t t . \ r A . T 

LJ * .r iL * ■ c i ma word processing or other text-intensive application. In 

embodiment of the present invention, for example, certain 4 , \ • »• u. * • * c *nn 

r j * £i a * _a j i ; j j « 2S the present invention, therefore, the user-interface 500 

types of data files are first converted to the standard file . , r , . „ ' , # c . , 

c «Ao^r T r»i - rn ../.^n, c . . includes an automatically generated list of search terms, 

format known as "ASCII Plain Text (ASCII) before being r j » i j ji u j* i j • 

. .. « i j l o * referred to as key words and key phrases, displayed in a 

linguistically analyzed by Syntactica. , . en : ' ™ * • *• ■ 4 . 

b keyword pane 501. The present invention improves upon the 

A Computer System for Performing the Method of keyword pane 205 shown in FIG. 2 and described above. In 

the Present Invention 30 ^ e P resent invention, keyword pane 501 includes a key 

phrase area 514, a key word area 510, a tabbed index 512, 

In one embodiment of the present invention, an apparatus and buttons 518, 520, and 522 for configuring the display of 

for performing the method steps described above includes information in keyword pane 501. The listed key words in 

the computer system 400 shown in FIG. 4. The present key word area 510 and key phrases in key phrase area 514 

invention may be implemented on a general purpose 35 act to "prompt" the user to search for information of interest 

microcomputer, such as one of the members of the Apple without requiring the user to explicitly conceive search 

Macintosh family, one of the members of the IBM Personal terms. Thus, the present invention relieves the user from the 

Computer family, or one of several work-station devices burden of creating a document search expression, and 

which are presently commercially available. In any event, a instead permits the user to browse the relevant documents on 

computer system as may be utilized by the preferred 40 the basis of the automatically generated key words and key 

embodiment generally comprises a bus 401 for communi- phrases. The key words and key phrases listed in areas 510 

eating information, a processor 402 coupled with said bus and 514 prompt the user to construct a query expression via 

401 for processing information, a random access memory a drag and drop technique which is used to identify selected 

(RAM) or other storage device 403 (commonly referred to ones of the plurality of documents. As described above, any 

as a main memory) coupled with said bus 401 for storing 45 of the key words or key phrases shown areas 514 and 510 

information and instructions for said processor 402, a read may be dragged and dropped into search pane 240 shown in 

only memory (ROM) or other static storage device 404 FIGS. 2 and 5. Each of the areas 514 and 510 are separately 

coupled with said bus 401 for storing static information and scrollable using conventional vertical scroll bars 516 and 

instructions for said processor 402, a data storage device 517, respectively. 

405, such as a magnetic disk and disk drive, coupled with 50 Each of the areas 510 and 514 are dynamically sized to 

said bus 401 for storing information and instructions, an render visible the selected portion of the key words or key 

alphanumeric input device 406 including alphanumeric and phrases. The areas 510 and 514 are separated by a dynami- 

other keys coupled to said bus 401 for communicating cally placed separator 511. The position of separator 511 

information and command selections to said processor 402, varies depending upon whether either or both key words 

a cursor control device 407, such as a mouse, track-ball, 55 and/or key phrases have been selected for display using 

cursor control keys, etc., coupled to said bus 401 for buttons 518 and 520. If key words are selected for display 

communicating information and command selections to the in area 510 using button 520, the separator 511 shifts left to 

processor 402 and for controlling cursor movement, and a enlarge the display area 510 available for the display of key 

display device 409 for receiving display data from the words. An example of this is shown in keyword pane 601 

processor 402 and presenting the display data to the com- 60 illustrated in FIG, 5. Separator 511 may also be shifted all 

puter user. Additionally, it is useful if the system includes a the way over to the left margin of pane 501 thereby 

hardcopy device 408, such as a printer, for providing per- displaying only key words and selectively suppressing the 

manent copies of information. The hardcopy device 408 is display of key phrases. If key phrases are selected for 

coupled with the processor 402 through bus 401. display in area 514 using button 518, the separator 511 shifts 

Computer system 400 also includes a computer network 65 right to enlarge the display area 514 available for the display 

access device 411 for connecting to a computer network of key phrases. Separator 511 may also be shifted all the way 

such as the Internet. Computer network access device may over to the right margin of pane 501 thereby displaying only 
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key phrases and selectively suppressing the display of key 
words. The width of each of the areas 510 and 514 is 
dynamically adjusted based upon the width of the key words 
or key phrases currently being displayed in these areas. 
Thus, the text content is used to determine the display area $ 
size. Specifically, the width of area 514 is set to the width of 
the longest key phrase currently being displayed in area 514. 
An example of this is shown in FIG. 5. Once the width of 
area 514 is set based upon its content, the width of area 510 
may be determined. Given the area 510 left over in keyword 10 
pane 501 after the width of area 514 is determined, key 
words may be displayed in a dynamically-created multi- 
column format to consume the available area 510. The width 
of each of the columns in this area is dynamically set to the 
width of the longest key word currently being displayed in 15 
that column. If areas 514 and 510 cannot be dynamically 
sized wide enough to render visible a long key word or key 
phrase, horizontal scroll bars are automatically inserted to 
render area 510 or 514 as a virtual view area into the key 
word or key phrase data. Additionally, the font of the text 2 q 
displayed in areas 510 and 514 can be dynamically modified 
to efficiently use the display area provided in these areas 
given the text content that must be displayed. 

Beside each keyword and key phrase presented in areas 
510 and 514, a relevance code 208 (shown in FIG. 2) of the 2 5 
keyword or key phrase may be selectively displayed. Button 
522 is used to toggle on/off the display of this numerical 
information. As stated above, the key words and key phrases 
of the areas 510 and 514 are obtained by linguistically 
analyzing each of a plurality of documents and, in the 30 
preferred embodiment, relevance codes 208 are generated 
by the linguistic analysis. 

Keyword pane 501 includes a tabbed index 512, which is 
used to select for display the key words or key phrases 
beginning with the letters or numbers on a corresponding 35 
selected tab of tabbed index 512. Referring again to FIG. 5, 
a tabbed index 512 is shown. Each tab of tabbed index 512 
includes an alphanumerical symbol or symbols that corre- 
spond to the first letter of key words or key phrases dis- 
playable in keyword pane 501. Any one tab of tabbed index 40 
512 may be selected using a conventional pointing device or 
mouse. Upon selection of a tab, the alphanumerical symbol 
on the tab is used as a search symbol. The key words and key 
phrases are searched for the first occurrence of a matching 
key word or key phrase that begins with the search symbol. 45 
If found, the matching key word or key phrase is displayed 
in area 514 for a matching key phrase and in area 510 for a 
matching key word. In one embodiment, the matching key 
word or key phrase is displayed at the top or first line in the 
area 510 or 514 and subsequent key words or key phrases are 50 
filled in beneath the first line. In an alternative embodiment, 
the matching key word or key phrase is displayed centered 
at the line in the middle of the area 510 or 514 and previous 
key words or key phrases are filled in above the matching 
centered key word or key phrase and subsequent key words 55 
or key phrases are filled in beneath the matching centered 
key word or key phrase. If a tab includes more than one 
alphanumeric character in a character sequence, the first 
alphanumeric character in the sequence is used as the search 
symbol. 60 

In the example shown in FIG. 5, a tab 513 labeled "D" has 
been selected by a user. In this case, the letter "D" becomes 
the search character. In response to this selection, the present 
invention has searched the set of previously generated key 
words and has displayed the first matching key word begin- 65 
ning with the search symbol "D" in area 510. In this 
example, the matching key word is displayed in the first line 



of the area 510. Also in response to the selection, the present 
invention has searched the set of previously generated key 
phrases and has displayed the first matching key phrase 
beginning with the search symbol "D" in middle line of area 
514. Other key phrases are filled in around the matching key 
phrase. In addition, the portion of a line segment displayed 
underneath the tabbed index 512 at the selected tab 513 is 
removed to indicate this tab has been previously selected. In 
areas 514 and 510, horizontal line segments are inserted in 
the text to mark the transition between groups of key words 
or key phrases having a common first symbol to a next group 
of key words or key phrases having a next common first 
symbol. In the preferred embodiment, the key words and key 
phrases are sorted alphanumerically. 

The alphanumerical symbol or symbols on the tabs of 
tabbed index 512 are dynamically generated based upon the 
content of the key words or key phrases they represent. 
These tab symbols are dynamically generated from the key 
word and key phrase content in the following manner. 

First the key word and key phrase content is scanned to 
determine the first alphanumeric character appearing for 
each key word and key phrase. Next, the total number of key 
words and key phrases beginning with the same alphanu- 
meric character are tallied for each alphanumeric character. 
The average number of key words and key phrases begin- 
ning with the same alphanumeric character is then com- 
puted. Groups of sequential alphanumeric characters are 
collected such that the total number of key words and key 
phrases beginning with the alphanumeric characters from 
the group approaches the average previously computed. In 
some cases, a single alphanumeric character may have 
enough key words and key phrases beginning with that 
alphanumeric character that the total for that alphanumeric 
character approaches the average previously computed. In 
other cases, a group of alphanumeric characters must be 
collected to have enough key words and key phrases begin- 
ning with those alphanumeric characters so the total for that 
group of alphanumeric characters approaches the average 
previously computed. Once these single alphanumeric char- 
acters or groups of alphanumeric characters are determined, 
the single alphanumeric character symbol or symbols rep- 
resenting the groups of alphanumeric characters are inserted 
into the tabs of the tabbed index shown by example in FIGS. 
5-7. 

In an alternative embodiment of the present invention, the 
archive catalog may be a collection of documents residing at 
arbitrary sites on the World Wide Web (WWW). These 
documents or pages may be accessed and referenced using 
their conventional Universal Resource Locator (URL). 
Referring now to FIG. 6, a web page list 610 is included in 
window 605. Web page list 610 includes a URL for each of 
the WWW resident documents in the archive catalog for the 
present invention. In the manner described above, the key 
words and key phrases of the areas 510 and 514 are obtained 
by linguistically analyzing each of the plurality of docu- 
ments from the archive catalog. In this alternative 
embodiment, these documents are web pages identified in 
web page list 610. In a manner similar to the linguistic 
analysis performed on locally resident files, the web pages 
are scanned for key words and key phrases. These Web 
resident key words and key phrases are then displayed in 
prompted keyword pane 611. The keyword pane 611 oper- 
ates in the same way as keyword pane 501 described above 
in connection with FIG. 5. 

The URLs displayed in web page list 610 are organized in 
a hierarchical fashion. In a manner similar to the conven- 
tional hierarchical organization of documents or files within 
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folders or directories, the present invention displays a hier- other previously defined concept identifiers. The key words, 

archical organization of web pages within web sites. The full key phrases, and other concept identifiers that define a 

list of web pages for a particular web site may be expanded concept identifier may be combined into a logical expression 

and displayed in area 610 by selecting the boxed plus sign using "AND", "OR", and "NOT" operators. These operators 

symbol provided in one embodiment of the present inven- 5 are well known to those of ordinary skill in the art. The 

tion. concept identifier may therefore be used to represent a 

Referring now to FIG. 7, the present invention also lo g ical expression, 

includes a concept editor. The concept editor is used to The concept identifier and the logical expression that it 

create a hierarchy in the specification of search terms or key represents may be conveniently used for document search 

words and key phrases. Using the concept editor of the 10 and query operations. 

present invention, a set of related key words or key phrases There are many applications for the concept identifier 
may be grouped together under a single concept identifier. feature of the present invention. For example, one of the 
The concept identifier may then be used to specify a search important features of the Internet is subscription to various 
for any of the related key words or key phrases that the alt. newsgroup services. A newsgroup subscriber receives 
concept identifier represents. 15 periodic updates through electronic mail. The concept editor 
FIG. 7 illustrates a window 701 which is used to control of the present invention may be used to create a compound 
the concept editor. Window 701 includes a keyword pane concept identifier representing a logical expression that 
705. Keyword pane 705, as described above, provides a defines the particular newsgroup content of interest to a 
means for displaying and indexing into a plurality of key particular subscriber. Using this concept identifier, the sub- 
words and key phrases associated with a collection of 20 scriber may conveniently browse for his/her specific areas of 
archive documents or WWW pages. Any one or more of interest or an automatic browse and capture function may be 
these key words and key phrases may be selected, dragged, activated, 
and dropped into other display areas using conventional 

means. Window 701 also includes a concept specification Improved Method for Obtaining an Initial 

area 715 including an "include" area 720 and an "exclude" 25 Document Set 

area 725. These areas are used for specifying the items As stated above, the method of the present invention can 

included or excluded from the set of related key words or be applied to analyze documents on the World Wide Web 

key phrases grouped together under a single concept iden- (the '"web"). The World Wide Web is a vast collection of 

tifier. These areas are used in the manner described below. ^ documents, called web pages, that have been formatted in 

Window 701 also includes a dialog box 710 with which Hypertext Markup Language (HTML) and linked together 

a user may enter the name of a concept identifier that using an HTML construct called hypertext. Hypertext is a 

represents the set of related key words or key phrases character string accompanied by a Universal Resource Loca- 

grouped together under the specified name. In the example tor (URL, described above). Computer programs known as 

of FIG. 7, a user has entered the concept identifier name 35 "browsers" can be used to view web pages and allow users 

"Motorcycles", The user may now drag and drop key words to dereference hypertext links to "travel" to the web page 

or key phrases from keyword pane 705 into either include indicated by the link's URL. From the perspective of the 

area 720 or exclude area 725. In this example, it is antici- browser user, the World Wide Web is an enormous data 

pated that a user would drag and drop text items related to object that can be viewed one web page at a time by 

the concept identifier name "Motorcycles" — perhaps make/ 4Q following hypertext links. 

model information or specifications for specific types of A fundamental characteristic of the web is that its linked 

motorcycles. Items dropped into area 720 will qualify a web pages are distributed among a large number of 

subsequent search to require matching text include one or independently-controlled, networked computers referred to 

more of these items. Items dropped into area 725 will qualify as "web sites". As a result, the vast amount of data on the 

a subsequent search to require matching text not include any 45 we b has virtually no organizational structure beyond that of 

of these items. In this manner, a complex keyword query individual web pages. 

may be specified and represented by the concept identifier. To ma fc e information on the web more accessible, a 

In a subsequent search of archive documents or WWW number of web sites include search engines that can be used 

pages, a user need only enter the concept identifier and the to un( j we b pages containing text consistent with a search 

query it represents is automatically configured. 5Q expression. A search engine is a computer program which, 

Concept identifiers may also be hierarchically created. A when executed, accepts a search expression entered by a 

previously created concept identifier may be dragged and remote user (usually through a browser), then inspects web 

dropped into the specification area 715 of a subsequently pages looking for content consistent with the search expres- 

created concept identifier. In this manner, the specification of sion. If a web page contains text consistent with the search 

a concept identifier may include other concept identifiers. 55 expression, the URL of the web page is logged in the search 

For example, a user may create a concept identifier "Motor engine and ultimately returned to the remote user. In many 

Vehicles". The previously created concept identifier "Motor- cases the search expression simply a character string, but the 

cycles" may be dragged and dropped into area 720 when the search expression may also include Boolean operators 

concept identifier "Motor Vehicles" is created. Other key (AND, OR, NOT). 

words, key phrases, or concept identifiers may be dragged 60 E vcn w i m th c benefit of a search engine, a computer-user 
and dropped in to area 715 as well. Concept identifiers may browsing the web can spend hours sifting through web page 
thereafter be dragged and dropped into search pane 240. content before happening upon something of interest. This is 
Thus, a very complex and hierarchical query structure may especially true when the user has only a broad idea of the 
be created using the concept editor of the present invention. information sought. For example, suppose one is interested 
In addition, the concept editor of the present invention 65 in luggage and decides to look for descriptions of luggage on 
also allows the creation of logical expressions or query the web. The first step would be to submit the term "lug- 
expressions which can include key words, key phrases, and gage" to a search engine and wait for the search engine to 
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return URLs. At this point the user is faced with reading 
through potentially hundreds of web pages only a few of 
which may contain luggage descriptions. Moreover, depend- 
ing on the volume of traffic on the web and the number of 
URLs the search engine is configured to find in a given 
search, there can be a significant delay while the search 
engine completes its search. 

The present invention can be used to much more effi- 
ciently browse the content of web pages thereby allowing 
users to quickly focus on pages of interest. 

FIG. 8 illustrates a method for identifying one of a 
plurality of web pages on the World Wide Web. At step 805, 
a computer user is prompted to construct a search expres- 
sion. The user may either construct the expression from a 
previously generated list of keyword phrases or simply type 
the expression. For example, FIG. 9 depicts a Control 
window 900 used to display a search expression constructed 
by the user, in this case, the term "luggage". In the one 
embodiment of the present invention, the Control window 
900 includes a history button 902 that can be used to recall 
previously constructed search expressions. 

Returning to method 800, at step 810, the search expres- 
sion is communicated to a plurality of search engines located 
at remote web sites. Then, at step 815, the search engines are 
each prompted to concurrently inspect web pages to identify 
an initial set of web pages containing text consistent with the 
search expression and to return a respective URL for each of 
the identified web pages. By prompting multiple search 
engines to concurrently identify respective sets of web 
pages, the search engines are made to process search 
requests in parallel to accelerate the web search. 

It will be appreciated that the steps 810 and 815 of method 
800 can be performed sequentially or in an interleaved 
manner. That is, the search expression could be communi- 
cated to each of the search engines before any one of the 
search engines is prompted to perform the search, or each 
search engine could be prompted to perform the search 
immediately after receiving the search expression and before 
the search expression is communicated to the next search 
engine. Either way, so long as searching is performed by the 
prompted search engines concurrently, the advantage of 
parallel processing is achieved. 

As discussed above, the present invention may be embod- 
ied in program code that can be executed by a processor. In 
one embodiment of the present invention, a number of views 
of information are made available to the user in windows 
displayed by an executing application program. These views 
are an alternative to several of the data presentation tech- 
niques discussed above. 

FIG. 10 depicts a Contents View window 1005 used to 
display URLs returned by the search engines prompted in 
step 815 of method 800. Redundantly returned URLs (i.e., 
URLs already found by another search engine) are removed 
so that a filtered and relatively comprehensive set of web 
pages is identified and presented in Contents View 1005. The 
set of web pages corresponding to the URLs presented in 
Contents View 1005 is referred to as the initial set of web 
pages. Contents View 1005 can be scrolled in the conven- 
tional manner using scrollbar 1007 to view URLs below the 
virtual window and is selected by clicking Contents tab 
1020. In one embodiment of the present invention, each 
URL (identified by a text string beginning with "1//") is 
displayed adjacent the title information taken from the 
corresponding web page. For example, the URL 
"//moriluggage.com/" 1008 is displayed adjacent the web 
page tide "Mori Luggage Gifts". Also, a icon (e.g., icon 
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1013) is displayed adjacent each URL to indicate that web 
page sentences and titles matching the search expression 
may be viewed. When clicked with a mouse or similar cursor 
control device, the icon is changed to a "-" icon (e.g., 

s icon 1015) and web page text matching the search expres- 
sion is displayed as indicated by 1019. 

Returning to the method of FIG. 8, at step 820, each of the 
web pages in the initial set of web pages is linguistically 
analyzed to identify keyword phrases therein. In one 

10 embodiment of the present invention, this is accomplished 
by downloading and linguistically analyzing the contents of 
each web page concurrently with the ongoing search initi- 
ated in step 815. In step 825, the computer user is prompted 
to construct a query expression in which at least one of the 

15 keyword phrases is an operand, and in step 830, the query 
expression is used to identify one web page of the initial set 
of web pages. 

As shown in FIG. 11, the keyword phrases extracted from 
each analyzed web page are displayed in a navigable cross- 

20 index in a Phrases View window 1105 that a set of alpha- 
betical tabs 1110 to allow a user to select a virtual window 
into the overall list of keyword phrases according to the first 
letter of the keyword phrase of interest. Herein the expres- 
sion "cross-index" refers to an alphabetized listing of ref- 

25 erences found in more than one document. The index 
displayed in phrases view 1105 is a cross-index because it 
contains keyword phrases found in more than one of the 
analyzed web pages. In one embodiment of the present 
invention, keyword phrases found in more than one web 

30 P a S e are displayed in a different color than those found in 
only one web page. This allows the user to quickly identify 
common themes among the initial set of web pages. The user 
is able to navigate the index of keyword phrases either by 
clicking one of the alphabetical tabs 1110 or by clicking one 

35 of the indexed keyword phrases. If the user clicks a keyword 
phrase indicated to have been found in only one of the 
analyzed web pages (i.e., a unique keyword phrase), an 
abstract of the corresponding web page is presented to the 
user in an abstract view (discussed below). In one embodi- 

4 0 ment of the present invention, the abstract has been previ- 
ously generated based on linguistic analysis of the web page. 
If the user clicks a non-unique keyword phrase (i.e., a 
keyword phrase found in more than one of the analyzed web 
pages), a list is presented identifying web pages in which the 

45 keyword phrase has been found. In one embodiment of the 
present invention, the user may select a web page from the 
list by moving the mouse cursor over a listed web page. A 
previously generated abstract corresponding to the selected 
web page is then displayed. The Phrases View 1100 is 

50 selected by clicking the Phrases tab 1120. 

FIG. 12 depicts a Words View window 1200 that allows 
a user to view the keywords extracted from the initial set of 
web pages. Words View 1200 is selected by clicking the 
Words tab 1220. Like the Phrases View 1100, the Words 

55 View 1200 includes an alphabetically tabbed, navigable 
cross-index 1210 and keywords found in more than one web 
page are displayed in a different color than those found in 
only one web page. It will be appreciated that other tech- 
niques may be used to distinguish unique keywords or 

60 keyword phrases from non-unique keywords or keyword 
phrases without departing from the spirit and scope of the 
present invention. 

FIG. 13 depicts a Links View window 1300 that allows a 
user to view a search tree 1302 resulting from the execution 

65 of the method 800 of the present invention. The initial search 
term entered in the Control window 900 is displayed at the 
root 1305 of the search tree 1302 (in this case, the term 
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"luggage"). The next branch below the root 1305 of the message abstracts. Thus, the present invention provides a 

search tree 1302 contains search expressions indicating the computer-user with a powerful technique for sorting through 

user-specified search term and the search engine to which mail by content without having to open and read each 

the search term is to be communicated. An example of this message. Later, after messages of interest have been sorted 

type of expression, referred to herein as a "search engine 5 from the rest, they can be opened and read in the usual 

expression", is shown at 1307. Search engine expression manner. 

1307 indicates that the term "luggage" is to be communi- The application program of the preferred embodiment 

cated to the AltaVista search engine. includes a number of options that can be set by the user to 

Web pages identified by a search engine are listed below control the generation of the initial set of web pages and the 
the search engine expressions in Links View 1300 in hier- 10 presentation of data in the various views. These options are 
archical order. For example, a first level page containing text presented in a number of options windows discussed below, 
consistent with the search expression and found by the FIG. 16 depicts a Quick Setup options window. The Quick 
AltaVista search engine is shown at 1309 of Links View Setup window allows a user to describe characteristics of the 
1300. Similarly, a second level web page found by the host computer to allow an application program to automati- 
Alta Vista search engine by following a hyper-text link in the ^ ca u y determine certain configuration parameters. Such con- 
first level web page is shown at 1311 of Links View 1300. figuration parameters include the number of search engines 
Links View 1300 is selected by clicking the Links tab 1320. t o be concurrendy executed to determine an initial set of web 

FIG. 14 depicts a Discards View window 1400 used to pages, the maximum number of web agents that can be 

display the URLs of each of the web pages identified by the invoked to manage search requests and other tasks, and the 

search engines in step 815 of method 800 that could not be 20 volume of data displayed in the information views. In 

downloaded. The unavailability of a web page is indicated embodiment depicted in FIG. 16, the computer-user is 

by a torn web page icon (e.g., 1405) displayed adjacent each prompted to specify the processor speed via slide bar 1605, 

URL listed in the Discards View 1400. Discards View 1400 the amount of core memory via slide bar 1610 and the 

is selected by clicking Discards tab 1420. modem speed via slide bar 1615. After these characteristics 

FIG. 15 depicts an Abstract window 1500 used to display 25 nave becn specified, button 1620 is clicked and the configu- 

an abstract of the web page identified instep 830 of method ration parameters considered to best match the host com- 

800. A web page abstract may also be selected by clicking P uter ' s capabilities are selected. It will be appreciated that in 

a web page icon (or URL) in the Contents View 1000. In one an alternative embodiment, the application program could 

embodiment of the present invention, an abstract is gener- 30 query system resources to determine the host computer's 

ated for each web page of the initial set of web pages and characteristics. 

then the web page is discarded. This way, system memory is FIG. 17 depicts a Search options window 1700 that can be 

conserved. The user may recall the full web page if desired. used to specify the web searching engines to be used to 

Each abstract is generated based on concept sentences identify the initial set of web pages and to specify the 

identified in the web page as described above. 35 number of web pages to be located by each search engine in 

In an alternative embodiment, the initial set of web pages a S ivcn search - Alist of search en e ines a PP ear bv default in 

can be saved and then queried in a second level query. For the search «gine selection window 1705. The user can add 

example, a new search expression may be entered, but rather to list and then select from among the listed search 

than searching the web for new pages related to the search by pressing install button 1706 while a listed engine 

expression, the initial set of web pages previously obtained 40 15 higkUghted. Each installed search engine will be used to 

can be searched using the new search expression. In one identify web pages as described in steps 810 and 815 of 

embodiment of the present invention, previously down- method 800. 

loaded pages matching the search expression are displayed Slide bar 1707 may also be adjusted by the user to indicate 

in the Contents View window (FIG. 10, discussed above) the maximum number of web agents that may be concur- 

while previously downloaded pages not matching the search 45 rently executed to manage search operations and other tasks, 

expression are routed to the Discards View (FIG. 14, dis- Web agents are discussed in greater detail in reference to 

cussed above). This feature of the present invention, referred FIG. 18 and FIG. 19. 

to herein as "document filtering", allows the initial set of Other options that can be specified by the user include 

web pages to be shuffled between the Contents and Discards search filter parameters that can be used to filter web pages 

views with each new search expression, depending on 50 that do not exactly match the search expression from the 

whether the web pages contain expression-matching text. initial set of web pages, verbosity settings for indicating the 

One application for document filtering is electronic mail maximum number of words in a keyword phrase or in an 

sorting. Computer users receive electronic mail from many abstract and settings to control the manner in which text is 

sources (e.g., co-workers, internet contacts, newsgroups) displayed in the various views. 

and in ever-increasing volume. The present invention can be 55 FIG. 18 is a block diagram of an application program 

used to download and analyze electronic mail files stored on 1800 according to one embodiment of the present invention, 

a network mail server in a manner similar to the way web Application program 1800 includes program code execut- 

pages are downloaded and analyzed. In one embodiment of able to provide user-interface 1805, thread manager 1810 

the present invention, different dynamic link libraries are and web agents (1812, 1814, 1816, 1818). As stated above, 

provided to support electronic mail message download from 60 the exact number of web agents is determined by user 

different electronic mail servers. Copies of electronic mail settings. User interface 1805 receives search requests from 

messages are downloaded from the server and then analyzed an application user, and sends the search request to thread 

to generate lists of keyword phrases and keywords, and, for manager 1810, as indicated by arrow 1806. Thread manager 

each mail message, an abstract. The user can then enter 1810 communicates work orders corresponding to the search 

search expressions to shuffle the different mail messages 65 request to an idle one of web agents 1812, 1814, 1816, 1818. 

between the Contents View window and Discards View In one embodiment of the present invention, there are at 

window as described above. The user may also view mail least three types of work orders. The first type of work order 
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is a request to resolve a search expression into a number of input data is received either from a computer-user, web 

search engine expressions. For example, upon receiving the agents or both. At step 1910 the input data is displayed in the 

search expression "search:luggage", a web agent (or the appropriate window. At decision step 1915 the input data is 

thread manager itself, in an alternate embodiment) might examined to determine if it indicates that further processing 

generate the search engine expressions 5 is required. In the case of a search expression entered by the 

"query: AltaVista: luggage", " query: Yahoo luggage", and user, a query expression returned by a web agent (or thread 

others. As discussed above in reference to FIG. 17, the manager, as the implementation may be), or a URL returned 

number and identity of the search engines for which search by a web agent, further processing will be required and 

expressions are generated is determined according to user execution proceeds to step 1920. At step 1920, a thread 

specification. After the search expression has been resolved manager procedure referred to herein as " Generate - 

into search engine expressions, the search engine expres- WorkList" is called, passing the input data as one or more 

sions are communicated to the user-interface portion 1805 of parameters. After procedure Generate WorkList has been 

the application program 1800 as indicated by arrow 1820. completed, execution loops back to step 1905 to scan for 

The user-interface 1805 displays the search engine expres- more input data. Also, if at decision block 1915, it is 

sions in the Links View window (element 1300 of FIG. 13) determined that the input requires no further processing, 

as discussed above, then passes the search engine expression 15 execution of the user interface loops back to step 1905. 

to the thread manager 1810 for processing according to a [ n the preferred embodiment of the present invention, 

second type of work order. user-interface code is executed in one thread of a muiti- 

A second type of work order communicated to web agents threaded application program. However, execution of user- 

1812, 1814, 1816, 1818 by thread manager 1810 is a request interface code in a separate process of a multi-processed 

to communicate a search engine request to a search engine. 20 application p r0 gram or execution of user-interface code as 

Since, in the ■preferred embodiment of the present invention, t of a sin gte-p r0 ces S application program are considered 

web agents 1812, 1814, 1816 1818 are independent execu- tQ be ^in the spirit and scope of the present invention, 

tion threads (separate executions of the same instance ol „ T ^, „ . r . - iL , 

program code), multiple web agents can concurrently com- ^0. 20 is a execution diagram of thread manager pro- 

municate search engine requests to respective search 25 ccdurc GenerateWorkList. Procedure GenerateWorkList 

engines. Since substantial time can be spent connecting and receives one or more ^ xch expressions, search engine 

traveling to search engine web sites, parallel operation by expressions or URLs as an input parameter or parameters 

web agents can substantially accelerate the web searching and > at ste P 2005 > adds tne indicated work item to a work list, 

process. At step 2010, procedure StartWork is called to issue work 

Once the search engines have been prompted to identify 30 orders to web a g ents a ccording to the work list. After 

web pages containing text consistent with the search StartWork is completed, procedure GenerateWorkList 

expression, the web agents 1812, 1814, 1816, 1818 continue returns to its caller. 

to communicate with respective search engines to receive FIG. 21 is an execution diagram of thread manager 

identified URLs. The web agents 1812, 1814, 1816, 1818 procedure StartWork. At decision step 2155, a list of web 

communicate received URLs to the user-interface 1805 35 agents is examined to determine if a web agent is idle. If no 

where they are displayed in various information windows as idle web agent is found, at step 2160, the number of existing 

discussed above (e.g., FIG. 13 Links View, FIG. 10 Contents web agents is compared against a user-defined maximum 

View). After the initial set of URLs have been displayed by number of web agents. If less than the maximum allowed 

the user-interface 1805, they are communicated to the thread number of web agents exist, then a new web agent is started 

manager 1810 for processing according to a third type of 40 and marked as idle at step 2165. As stated above, in the 

work order. preferred embodiment of the present invention, web agents 

The third type of work order communicated to the web are implemented as execution threads. However, web agents 

agents 1812, 1814, 1816, 1818 by thread manager 1810 is a could also be separate processes. 

request to retrieve and analyze web pages. At this point, the After step 2165, execution of procedure StartWork loops 

parallel execution of the web agents is particularly benefi- 45 back to decision step 2155 where the newly started idle web 

cial. In most cases search engines do not perform a web agent is detected. 

search in response to a query, and instead return URLs After an idle web agent is detected at step 2155, step 2170 

stored in previously recorded logs. Unfortunately search is executed to communicate a work order to the idle web 

engine logs, at least in part, can become out of date by days agent. The work order corresponds to an item inserted in the 

or even weeks. Since content on the web is ever-changing, 50 work list by procedure GenerateWorkList so that, after a 

search engines often return URLs to non-existent or relo- work order is issued to a web agent, the corresponding item 

cated web pages. When a web agent attempts to download is removed from the work list. As discussed above, in one 

such a non-existent or relocated web page, substantial time embodiment of the present invention, the work order is a 

may pass before the web agent gives up. If only one web request either to generate one or more search engine 

agent was operating at a time, web page analysis would 55 expressions, initiate a search by a search engine or download 

come to a standstill, at least temporarily. However, since and analyze a URL indicated web page. Other work orders 

multiple web agents are concurrently executed to manage such as sending one or more e-mail messages to aid in search 

the web searching operation, web page analysis goes for- engine evaluation or program debugging is within the spirit 

ward rapidly despite occasional inability to locate URL and scope of the present invention, 

indicated web pages. 60 After a work order is sent to the idle web agent in step 

After web agents 1812, 1814, 1816, 1818 have linguisti- 2170, the work list is examined at decision step 2175. If the 

cally analyzed downloaded web pages to extract keyword work list is empty, procedure StartWork is exited, returning 

phrases, keywords and abstracts, the extracted information is to its caller. Procedure StartWork is also exited if it is 

provided to the user-interface 1805 for display in the appro- determined at decision step 2160 that the maximum number 

priate view. 65 of web agents have already been created. 

FIG. 19 illustrates an execution diagram of one embodi- FIG. 22 is an execution diagram of a web agent. At step 

ment of the user-interface 1805 of FIGS. 18. At step 1905 2205, an input queue is inspected to determine if a work 
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order has been received. If so, at step 2210 the work order 
is executed by either generating one or more search engine 
expressions, prompting a search engine to perform a search, 
or downloading and analyzing a web page. As stated above, 
other types of work orders are possible. When the work 5 
order is completed, the results are sent to the user-interface 
in step 2215. Then, at step 2220, the web agent marks itself 
as idle and, at step 2225, calls thread manager procedure 
StartWork. Consequently, if there are additional work items 
to be processed, StartWork will communicate another work 
item to the web agent. After procedure StartWork is 
completed, execution loops back to decision step 2205 to 
begin polling for work orders. In one embodiment of the 
present invention a periodically executed procedure termi- 
nates web agents that have been idle for longer than a 
predetermined period of time. Other techniques may be used 15 
to terminate web agents, including self termination after 
executing step 2205 a threshold number of times in succes- 
sion. 

A method and apparatus for identifying a document based 
on keyword phrases automatically extracted from an initial 20 
set of documents is thus described. 

What is claimed is: 

1. A method for presenting to a computer-user information 
from web pages containing text consistent with a search 
expression, said method comprising the computer- 25 
implemented steps of: 

prompting a computer-user to construct a search expres- 
sion; 

communicating the search expression to a plurality of web 
searching engines; 30 

prompting each of the plurality of web searching engines 
to concurrently inspect a respective plurality of web 
pages and to identify web pages containing text con- 
sistent with the search expression; 35 

linguistically analyzing the identified web pages to obtain 
keyword phrases therefrom; and 

displaying the keyword phrases obtained from the iden- 
tified web pages in a navigable cross-index. 

2. The method of claim 1 wherein said step of displaying 4Q 
the keyword phrases obtained from the identified web pages 

in a navigable cross-index comprises the step of indicating 
keyword phrases displayed in the navigable cross-index 
found in more than one of the identified web pages. 

3. The method of claim 2 wherein said step of indicating 45 
keyword phrases displayed in the navigable cross-index that 
have been found in more than one. of the identified web 
pages comprises the step of displaying keyword phrases that 
have been found in more than one of the identified web 
pages in a different color than keyword phrases that have 50 
been obtained from only one of the identified web pages. 

4. The method of claim 1 further comprising the steps of: 
detecting user selection of one of the keyword phrases 

displayed in the navigable cross-index; 

determining one of the identified web pages from which 55 
the one of the keyword phrases was obtained; and 

displaying a web page abstract generated based on lin- 
guistic analysis of the one of the identified web pages. 

5. The method of claim 1 wherein said step of commu- 
nicating the search expression to a plurality of web search- 60 
ing engines comprises the step of communicating the search 
expression to a number of web searching engines, the 
number of web searching engines being determined based 
on characteristics of the computer implementing said step of 
communicating. 65 

6. The method of claim 5 wherein the number of web 
searching engines is determined based on at least one of the 
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processor speed, modem speed, and memory size charac- 
teristics of the computer implementing said step of commu- 
nicating. 

7. The method of claim 1 wherein said step of prompting 
each of the plurality of web engines to concurrently inspect 
a respective plurality of web pages comprises the step of 
prompting one of the plurality of web searching engines to 
inspect a number of web pages, the number of web pages 
being based on a parameter entered by the computer-user. 

8. The method of claim 1 further comprising the steps of: 
automatically identifying for the computer-user keyword 

phrases in an initial set of web pages, the initial set of 
web pages being defined by the web pages containing 
text consistent with the search expression; 
prompting the computer-user to construct a query expres- 
sion in which at least one of the keyword phrases is an 
operand; and 

identifying one of the initial set of web pages based on the 
query expression. 

9. The method of claim 8 further comprising the step of 
displaying a tabbed index to the keyword phrases. 

10. The method of claim 8 wherein said step of identifying 
keyword phrases in the initial set of web pages comprises the 
step of linguistically analyzing each web page of the initial 
set of web pages to identify the keyword phrases therein. 

11. A computer- readable medium having stored thereon a 
plurality of sequences of instructions, said plurality of 
sequences of instructions including sequences of instruc- 
tions which, when executed by a processor, cause said 
processor to: 

prompt a computer-user to construct a search expression; 
communicate the search expression to a plurality of web 

searching engines; 
prompt each of the plurality of web searching engines to 

concurrently inspect a respective plurality of web pages 

and to identify web pages containing text consistent 

with the search expression; 
linguistically analyze the identified web pages to obtain 

keyword phrases therefrom; and 
display the keyword phrases obtained from the identified 

web pages in a navigable cross-index. 

12. The computer-readable medium of claim 11 wherein 
said step of communicating the search expression to a 
plurality of web searching engines, comprises the step of 
communicating the search expression to a plurality of web 
searching engines provided at respective sites on the World 
Wide Web. 

13. A computer system comprising: 
a bus; 

a processor coupled to said bus; 

a user input device coupled to said bus; 

a display coupled to said bus; 

a computer-network access device coupled to said bus; 
and 

a memory coupled to said bus, said memory being read- 
able by said processor and having sequences of instruc- 
tions stored therein which, when executed by said 
processor, cause said processor to: 
prompt a computer-user to construct a search expres- 
sion; 

communicate the search expression to a plurality of 
web searching engines on the World Wide Web via 
said computer-network access device; 

prompt each of the plurality of web searching engines 
to concurrently inspect a respective plurality of web 
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pages and to identify web pages containing text 
consistent with the search expression; 
linguistically analyze the identified web pages to obtain 

keyword phrases therefrom; and 
display the keyword phrases obtained from the identi- 
fied web pages in a navigable cross-index. 
14. A method for obtaining web pages containing text 
consistent with a search expression, said method comprising 
the computer-implemented steps of: 

prompting a computer-user to construct a search expres- 
sion; 

starting a plurality of web agents to communicate the 
search expression to respective web searching engines; 

concurrently receiving in each of the plurality of web 
agents universal resource locators (URLs) identifying 
respective web pages containing text consistent with 
the search expression; 

linguistically analyzing the identified web pages to obtain 
keyword phrases therefrom; and 

displaying the keyword phrases obtained from the iden- 
tified web pages in a navigable cross-index. 



15. The method of claim 14 wherein said step of starting 
a plurality of web agents to communicate the search expres- 
sion to respective web searching engines comprises the step 
of executing a plurality of execution threads in a multi- 

s threaded application program. 

16. A method for examining electronic mail, said method 
comprising the computer-implemented steps of: 

reading a plurality of electronic mail messages from a 
10 mail server; 

linguistically analyzing each of the plurality of electronic 
mail messages to identify for a user keyword phrases 
therein; 

15 prompting the user to construct a query expression in 
which at least one of the keyword phrases is an oper- 
and; and 

sorting the plurality of electronic mail messages based on 
20 the query expression. 
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